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Abstraci- In this paper, we investigate the chaotic behavior of the 
biological sequences among the different species. Throughout 
this work, we have characterized the biological sequences 
according to their moment invariant, correlation dimension, and 
largest Lyapunov exponent estimates. We have applied our 
model to a number of human and mouse genomes encoded into a 
set of integers (time series) using a plain table mapping scheme. 
Our results indicate that the nonlinear dynamical characteristics 
have yielded significant differences between the sequences of the 
different species. That is, we have been able to classify the 
different genome sequences according to their chaotic 
parameters estimates. On the other hand, through our 
investigation we have found that the use of the chaotic modeling 
of the biological sequences could open new frontiers in the 
sequence similarity search techniques. 
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I. Introduction 

The goal of biological sequence alignment is to 
identify regions of similarity (often interpreted as homology) 
between two or more sequences, and associate these regions 
with one another to enable further comparisons. Existing 
algorithms for sequence alignment and validation are 
adequate for many problems as even if a complete solution 
for sequence alignment were available, mathematical or 
statistical optimality and biological optimality are not 
equivalent, due to the inevitable violations of implicit or 
explicit evolutionary models [7], [8]. 

These challenges have led to the development of new 
methods for similarity detection; in this work, the genome 
sequence was encoded to a nonlinear dynamical time series 
(signal) for feature extraction by different techniques. Such 
techniques work by transforming the mostly qualitative 
diagnostic criteria into a more objective quantitative signal 
feature classification problem. Classical techniques have been 
used to address this problem such as the similarity detection 
using the autocorrelation function [1], using frequency 
domain features [2], time frequency analysis [3], and wavelet 
transform [4], [5]. Other techniques used adaptive filtering 
[6], sequential hypothesis testing [7], [8], as well as 
morphological features. Even though fairly good results have 
been obtained using such techniques, they seem to provide 
only a limited amount of information about the signal 
because they ignore the underlying nonlinear signal 
dynamics. 

In the last two decades, there has been an increasing 
interest in applying techniques from the domains of nonlinear 
analysis and chaos theory in studying biological systems [9]. 
In the field of chaotic dynamical system theory, several 
features can be used to describe system dynamics including 



moment invariants, correlation dimension (D2) and 
Lyapunov exponents. In this work, these features have been 
used to explain different genome sequences encoded to its 
signal behavior by several studies [12]. In this paper, we 
address the problem of characterizing the nonlinear dynamics 
of our sequence. The implementation details to automatically 
compute three important chaotic system parameters namely, 
the moment invariants, correlation dimension and largest 
Lyapunov exponent, are discussed using the Open TSTool 
MATLAB package. The proposed implementations were 
used to compute these features for a twenty independent 
sequence encoded time series signals belonging to two 
different genomes: the human and mouse genome, 
downloaded from the Matrix Science - Help - Sequence 
Database Setup - IPI [10]. The results are studied to detect 
statistically significant differences among different genome 
types. Finally, statistical classification techniques are used 
such as K-means clustering to assess the possibility of 
similarity detection and classification using such parameters. 

II. Methodology 

A. Phase Space Trajectory Reconstruction. 

In this section we briefly demonstrate basic steps for 
chaotic time series analysis. We start first by encoding the 
different genome sequences into a time series signal as shown 
in figure (l.a, b). A good choice for a delay time is yielded by 
using the first minimum of the auto mutual information 
function as shown in figure (l.c, d). The first minimum of the 
auto mutual information can be found at four. Now we need 
to know the minimal embedding dimension for both human 
and mouse time series signals. We use Cao's method with a 
delay time of four, a maximal dimension of eight, three 
nearest neighbors and reference point depending on the 
length of each signal. There is a kink in the graph shown in 
figure (l.e, f) produced by Cao's method at three. So we need 
a time delay reconstruction of human and mouse time series 
signals with embedding dimension 3 and delay 4. Finally we 
plotted the phase space trajectory for both human and mouse 
time series signals as shown in figure (l.g, h). The step 
following obtaining the phase trajectory of both human and 
mouse time series signal is the step of feature extraction. This 
can be done by applying the following three methods: 

1. Moment invariants. 

2. Correlation dimension. 

3. Lyapunov exponent. 

1) Moment Invariants: The mathematical description of a 
dynamical system consists of two parts: the state which is a 
snapshot of the process at a given instant in time and the 
dynamics which is the set of rules by which the states evolve 
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over time. To study the dynamics of our system, we first 
need to reconstruct the state space trajectory. 




a. human time series signal. 
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g. phase trajectory of human time 
series signal 
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b. mouse time series signal. 
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d. delay- time using the first 

minimum of the auto mutual 

information for mouse time series 



Minimum embedding dimension using Cao's method 




f. minimum embedding dimension 

using Cao's method for mouse time 

series signal 






h. phase trajectory of mouse time 
series signal 



Figure (1), basic steps of analyzing a chaotic time series 

system. 

The most common method to do this is using delay time 
embedding theorem to create a larger dimensional geometric 
object by embedding into a larger m-dimensional embedding 
space [1]. The embedding dimension m must be large enough 
for delay time embedding to work. When a suitable m value 
is used, the orbits of the system do not cross each other. The 
dimension m in which false neighbors disappear is the 
smallest dimension that can be used for the given data. The 
data is ready now for feature extraction by the moment 
invariants. These invariants are constructed using the 



generalized fundamental theorem of moment invariants 
(GFTMI), which was formulated [1]. In 1962, Hu [2] 
presented the fundamental theorem, of moment invariants 
(FTMI) for recognition of two dimensional images, subjected 
to general linear transformation. Only in 1991, after 21 years 
of publication [3], the CFTMI was formulated by another 
author [4]. Features obtained by moment invariants are 
simple calculated features that do not change under 
translation, scaling or rotation. The following equations 
calculate the seven features extracted from the ten human and 
mouse time series signals. 
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2) Correlation Dimension Estimation'. The correlation 
dimension provides a straightforward way to 
measure the spatial organization and hence the 
predictability (finite dimensionality) and 
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dimensionality of a given signal (time series). That 
is, the measure of correlation dimension provides a 
way to determine whether the signal (time series) 
has fractional dimension i.e. chaotic attractors. 
However, in order to estimate the correlation 
dimension it is required to reconstruct the state space 
trajectory of the time series. This can be 
accomplished using the delay time embedding 
theorem through the creation of larger dimensional 
geometric object by embedding into a larger m- 
dimensional embedding space. The embedding 
dimension m should be large enough for delay time 
embedding to work. When a suitable value for m is 
used, the orbits of the system don't cross each other 
[1]. However, we have selected the first minimum of 
the mutual information function as a suitable value 
for the embedding time lag. The embedding 
dimension has been estimated using the Cao's 
method [6]. Nevertheless, we have computed the 
correlation dimension using Taken' s estimator 
provided with the TSTOOL add-on toolbox for 
MATLAB. 

3) Lyapunov exponent: The notion of Lyapunov 
exponents is a generalization of the idea of the 
eigenvalues as a measure of stability of a fixed point 
(characteristic exponent) as it provides a measure of 
stability of a periodic orbit. That is, Lyapunov 
spectrum (exponents) characterizes the behavior 
(contraction or expansion) of the trajectories close to 
a fixed point. Therefore, these exponents provide a 
mean to measure the sensitivity to perturbed initial 
conditions. For a system to undergo chaotic 
dynamics, it must have at least one positive 
Lyapunov exponent. The largest Lyapunov exponent 
(lambdal), nevertheless, may be regarded as an 
estimator to the dominant chaotic behavior of the 
system [1]. However, in this work we have used the 
TSTOOL largest lyapunov estimation algorithm. 
This algorithm is similar to Wolf's algorithm and 
provides an efficient estimation of the largest 
lyapunov exponent through the calculation of the 
scaling (rate of increase) of the prediction error 
(separation of nearby trajectories) versus the 
prediction time. 

in. Results 

A. Moment Invariants 

We have applied moment invariants feature extraction 
method to a twenty human and mouse sequences encoded 
time series signals for feature extraction, these signals are 
plotted in figure (2). Figure (2. a), shows the ten human time 
series signals and figure (2.b), shows the other ten time series 
mouse signals. It is clear from figure (2.b) that the ten mouse 
signals are very similar but not identical. As described in the 
methodology section after calculating the seven features 
extracted by moment invariants, the mean of both human and 
mouse features is taken and plotted versus each other as 
shown in figure (3). It is clear from this figure, that the third 
feature is the most discriminate feature. 
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a. The ten time series human time series signals. 




b. The ten time series mouse time series signals. 



Figure (2), the twenty human and mouse time 

series signals 
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Figure (3), a comparison between |ih and \x^. 
B. Correlation Dimension 

We have applied the Taken' s estimator to the encoded 
sequences in order to obtain their correlation dimension 
estimates. The following table renders the fractional (chaotic) 
correlation dimension estimates of a set of human genes and 
mouse genes obtained from [10]. 
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Table (1), Correlation Dimension (D2) estimates of 
the translated gene sequences dataset. 



Human D2 


Mouse D2 


4.7579 


3.7796 


4.8145 


4.2047 


3.2511 


6.6643 


4.7954 


4.2047 


4.8337 


4.5172 


3.25 


2.6328 


10.7250 


3.5917 


7.4067 


3.2439 


15.0129 


5.1459 


15.7937 


10.4523 



We have used these estimates in order to classify the 
encoded sequences into their respective genomes using the K- 
means clustering algorithm. The classification accuracy using 
the correlation dimension estimates is depicted in table 3. 

C. Largest Lyapunov Exponent 

The largest Lyapunov exponent (LLE) has been estimated 
manually from the scaling (linear increase) of the prediction 
error versus the prediction time using the TStool algorithm. 
The following table depicts the LLE estimates of a set of 
human genes and mouse genes. 



Table (2), Largest Lyapunov Exponent (labdal) 
the translated gene sequences dataset. 



estimates of 



Human LLE 


Mouse LLE 


0.000778 (D2=4.7579) 


0.003 (D2=3.7796) 


0.0003347 (D2=4.8145) 


0.001 (D2=6.6643) 


0.0001666 (D2=3.2511) 


0.00146 (D2=6.66) 


0.000137 (D2=3.25) 


0.0010398 (D2=10.4523) 


0.000428 (D2= 4.7954) 


0.00078 (D2=3.2439) 


0.0001544 (D2=4.8337) 


0.000493 (D2=3.5917) 


0.00045 (D2=10.7250) 


0.00033 (D2=4.5 172) 


0.000406 (D2=7.4067) 


0.0005 (D2=3.59) 


0.000262 (D2= 15.7937) 


0.00140 (D2=6.6) 


0.000137 (D2=15.0129) 


0.0030 (D2=3.77) 



Furthermore, we have applied the K-means clustering 
algorithm to classify the sequences into their respective 
genomes. The classification accuracy using the LLE is 
depicted in table 3. 



Table 35), accuracy of the proposed feature extraction methods to K- means 
Clustering classifier. 





Moment 
invariants 


Correlation 
dimension 


Lyapunov 
exponent 


Human 


80% 


40% 


100% 


Mouse 


100% 


80% 


60% 



IV. Discussion and Conclusion 

In this work, we have characterized the biological 
sequences based on their nonlinear dynamical behavior. That 
is, we have established a nonlinear dynamical model consists 



of moment invariant, correlation dimension (D2), largest 
Lyapunov exponent (lambda 1) estimates of plain integer 
mapping encoded sequences. The pattern of this model's 
parameters has varied considerably between the different 
genomes. Furthermore, we have used the K-means clustering 
algorithm in order to classify the different sequences into 
their respective genomes. 

Experiments were performed on a dataset obtained from 
[10] to evaluate the reliability of the proposed nonlinear 
dynamical model. The proposed model has yielded 
reasonable classification accuracy between the human 
sequences and the mouse sequences. Nonetheless, due to the 
existence of similarities between some of the human 
sequences and the mouse sequences, our model has yielded 
low classification accuracy in some cases. Therefore, it is 
required to use longer sequences (more than 300 bases) in 
order to enhance the performance of our proposed model for 
the similar sequences. 

In conclusion, throughout this work we have found that the 
natural nonlinear dynamics that the biological sequences 
undergo differ between the different species. Therefore, it is 
rather encouraging to distinguish between the different 
species according to the nonlinear dynamical characteristics 
of their respective translated gene sequences. 
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